Reduced N-Grams for Chinese Evaluation
نویسندگان
چکیده
Theoretically, an improvement in a language model occurs as the size of the n-grams increases from 3 to 5 or higher. As the n-gram size increases, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach previously developed by O’ Boyle and Smith [1993] can be applied. A reduced n-gram language model, called a reduced model, can efficiently store an entire corpus’s phrase-history length within feasible storage limits. Another advantage of reduced n-grams is that they usually are semantically complete. In our experiments, the reduced n-gram creation method or the O’ Boyle-Smith reduced n-gram algorithm was applied to a large Chinese corpus. The Chinese reduced n-gram Zipf curves are presented here and compared with previously obtained conventional Chinese n-grams. The Chinese reduced model reduced perplexity by 8.74% and the language model size by a factor of 11.49. This paper is the first attempt to model Chinese reduced n-grams, and may provide important insights for Chinese linguistic research.
منابع مشابه
Reduced n-gram Models for English and Chinese Corpora
Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher. However, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach previously developed by O’Boyle (1993) can be applied. A reduced n-gram lang...
متن کاملMachine Translation 2008 Evaluation : Stanford University ’ s System Description
This document describes Stanford University’s first entry into a NIST MT evaluation. Our entry to the 2008 evaluation mainly focused on establishing a competent baseline with a phrase-based system similar to (Och and Ney, 2004; Koehn et al., 2007). In a three-week effort prior to the evaluation, our attention focused on scaling up our system to exploit nearly all Chinese-English parallel data p...
متن کاملStanford University’s Chinese-to-English Statistical Machine Translation System for the 2008 NIST Evaluation
This document describes Stanford University’s first entry into a NIST MT evaluation. Our entry to the 2008 evaluation mainly focused on establishing a competent baseline with a phrase-based system similar to (Och and Ney, 2004; Koehn et al., 2007). In a three-week effort prior to the evaluation, our attention focused on scaling up our system to exploit nearly all Chinese-English parallel data p...
متن کاملChinese Poetry Generation with Recurrent Neural Networks
We propose a model for Chinese poem generation based on recurrent neural networks which we argue is ideally suited to capturing poetic content and form. Our generator jointly performs content selection (“what to say”) and surface realization (“how to say”) by learning representations of individual characters, and their combinations into one or more lines as well as how these mutually reinforce ...
متن کاملFinding the Better Indexing units for Chinese Information Retrieval
In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams had been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carried out more experiments to find the better way to index Chinese texts. First, we investi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJCLCLP
دوره 10 شماره
صفحات -
تاریخ انتشار 2005